Data Flattening and Schema Management
Data Flattening and Normalization
DataStori supports API responses in JSON and XML file formats.
JSON and XML can both be complex structures that need to be flattened to make their data suitable for analysis and reporting.
By default, DataStori flattens all JSON and XML files and loads them into database tables for downstream consumption. The two examples below illustrate data flattening.
Example 1: Simple JSON API Response
[{
"name": "John Doe",
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA"
}
},
{
"name": "Alice",
"address": {
"street": "456 Cross St",
"city": "TownTown",
"state": "NY"
}
}]
Flattened JSON:
name,address.street,address.city,address.state
John Doe,123 Main St,Anytown,CA
Alice,456 Cross St,TownTown,NY
Example 2: Array Within a JSON API Response.
Let's say that an employee can handle multiple projects, and there is a need to read and report on employee and employee-project information.
In this case, DataStori will split the JSON response into two separate arrays - Employee and Employee_Projects, which are linked using 'employee.name' as the primary key.
DataStori flattens arrays till a depth of 1, i.e. any sub-array in projects will not be flattened, but converted and retained as a string.
Incoming JSON (API response):
{
"employees": [
{
"employee": {
"name": "Alice",
"department": {
"name": "HR",
"location": "Building A"
},
"projects": [
{
"name": "Employee Onboarding",
"status": "In Progress"
},
{
"name": "Benefits Administration",
"status": "Completed"
}
]
}
},
{
"employee": {
"name": "Bob",
"department": {
"name": "Engineering",
"location": "Building B"
},
"projects": [
{
"name": "Product Development",
"status": "In Progress"
},
{
"name": "Quality Assurance",
"status": "Completed"
}
]
}
},
{
"employee": {
"name": "Charlie",
"department": {
"name": "Finance",
"location": "Building C"
},
"projects": [
{
"name": "Budget Planning",
"status": "In Progress"
},
{
"name": "Financial Reporting",
"status": "Completed"
}
]
}
}
]
}
Flattened EMPLOYEE JSON:
employee.name, employee.department.name, employee.department.location
Alice, HR, Building A
Bob, Engineering, Building B
Charlie, Finance, Building C
Flattened EMPLOYEE_PROJECTS JSON:
employee_name, name, status
Alice, Employee Onboarding, In Progress
Alice, Benefits Administration, Completed
Bob, Product Development, In Progress
Bob, Quality Assurance, Completed
Charlie, Budget Planning, In Progress
Charlie, Financial Reporting, Completed
Schema Management
There are two data schemas to be managed:
- Incoming data - JSON or XML API response, CSV file attached to email, or input SQL table
- Destination database table to which data is being written
Incoming Data Schema
Over a period of time, a source application may add or remove columns in its API response, or in the other data input modes.
Unless monitored and acted upon, this will cause the associated data pipeline to fail, impacting all further processing and usage of the data.
DataStori automatically adjusts the data pipeline to include these schema changes and logs them to keep track of their evolution. This is essential for audit, and for any future actions related to a modified data payload.
Column deletion: The deleted column is retained in the data pipeline, and value set to NULL for all new incoming data.
Column addition: The column is added to the data pipeline, and set to NULL for all legacy data.
Users can view the JSON schema in the schema folder in the blob.
Currently, data type changes are not tracked. For example, if an integer column gets converted to string, DataStori will not capture this difference.
Destination Database Schema
The schema of the destination database tables is prepared by converting JSON or XML schema to database schema. This is done using the SQL Alchemy schema metadata.
The destination database schema in DataStori automatically evolves with the incoming JSON or XML schema.